CLUSTER INDEPENDENT ANNOTATION

Introduction

Cluster Independent Annotation (CIA) is a classification method meant to help researchers in the cell annotation step of scRNA-seq experiments. Exploiting two functions, given cell type signatures as input, this classifier computes a signature score for each cell (_signaturescore) and (with different modalities) it compares the score values in order to assign a label to each single cell (_signature_basedclassification).

This tool provides several advantages:

  • It synthesizes the information of a whole signature expression in a single score value, skipping the tedious one by one inspection of marker genes taken from long differentially expressed genes (DEGs) lists, which furthermore may not be cluster specific when taken singularly.

  • It provides a classification for each cell that is completely independent from clustering, thus, it can be used in parallel with a clustering method in order to set a proper resolution value, obtaining so coherent and easy to annotate cell groups.

  • Given the implemented modalities, it can be very fast and classify a huge dataset (hundreds of thousands cell) in a bunch of seconds, or it can be statistically more reliable exploiting the comparison of obtained signature scores with randomic signature ones. Since the latter method is of course computationally heavier, we implemented the possibility to parallelize processes.

  • Being signature-based, this tool can provide information about any kind of gene list with a biological meaning, allowing also functional annotation.

  • Normalizing for the gene signature length, it enables the comparison of genesets with different length, spanning from tens to thousands genes (obviously with different significance).

CIA is composed by two main modules:

  • investigate: a module which contains the functions to compute signature scores and automatically annotate cells accordingly (functions: signature_score and signature_based_classification)
  • report: a module which contains the functions to visualize score distributions in cell groups and classification performances with respect to a reference annotation (functions: group_composition, classification_metrics, grouped_classification_metrics and grouped_distributions)
In [1]:
import numpy as np
import pandas as pd
import scanpy as sc
import seaborn as sns
import multiprocessing
from functools import partial
from scipy.sparse import issparse
from scipy import sparse
import time
from sklearn import metrics
from scipy import sparse
import itertools
import matplotlib.pyplot as plt
In [2]:
from cia import investigate, report, utils

Description

A representative workflow is shown in this tutorial, in which we exploited CIA functions to automatically annotate PBMC3K scRNA-seq datasets at cellular level, starting from a set of gene signatures extracted from a PBMC atlas obained with CITE-seq method.

datasets.png

Our method requires as input an AnnData object with a raw expression matrix (AnnData.raw.X) and a dictionary containing the names of the signatures (e.g. cell type, cell state ...) as keys and correspondent gene names as values. With the intent of standardize the analysis, we strongly suggest to use AnnData objects preprocessed following those Scanpy tutorial steps:

  • standard filtering on cells and genes.
  • setting normalized and logarithmized raw gene expression values as .raw attribute.
  • subsetting the AnnData with highly variable genes.

Extracting gene signatures from PBMC atlas

For demonstration purposes, we decided to use as signatures the DEGs of Hao et al., 2021 [1] PBMC atlas. This dataset clusters have been confidently annotated exploiting “weighted-nearest neighbor” framework, an integrative analysis which takes into account both RNA and protein level information. This approach ensures that the extracted gene lists (RNA-based only) are actually associated to a specific cell type.

In [4]:
# To download the dataset (2.536 GB) 
!wget https://datasets.cellxgene.cziscience.com/b0381820-6536-487a-85d2-b5994ae0f1c8.h5ad -O data/b0381820-6536-487a-85d2-b5994ae0f1c8.h5ad


# to read the atlas data
atlas= sc.read('data/b0381820-6536-487a-85d2-b5994ae0f1c8.h5ad')
atlas
Out[4]:
AnnData object with n_obs × n_vars = 161764 × 20729
    obs: 'nCount_ADT', 'nFeature_ADT', 'nCount_RNA', 'nFeature_RNA', 'orig.ident', 'lane', 'donor', 'time', 'celltype.l1', 'celltype.l2', 'celltype.l3', 'Phase', 'nCount_SCT', 'nFeature_SCT', 'Cell type'
    var: 'features'
    uns: 'celltype.l1_colors', 'celltype.l2_colors', 'celltype.l3_colors', 'neighbors'
    obsm: 'X_apca', 'X_aumap', 'X_pca', 'X_spca', 'X_umap', 'X_wnn.umap'
    varm: 'PCs', 'SPCA'
    obsp: 'distances'
In [5]:
# to start from a count matrix
atlas.X = atlas.layers['corrected_counts']
atlas.X.todense()
Out[5]:
matrix([[0., 1., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 1., 0., ..., 0., 0., 0.],
        [0., 0., 1., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]])
In [6]:
# to normalize the data
sc.pp.normalize_total(atlas, target_sum=1e4)
sc.pp.log1p(atlas)
# to set AnnData.raw attribute
atlas.raw= atlas

This dataset has been annotated at 3 different levels of granularity. We focused on the coarser one, both for visualization purposes and to facilitate the comparisons with other datasets annotated at lower resolution.

In [7]:
sc.pl.umap(atlas, color='celltype.l1')
In [8]:
sc.pl.umap(atlas, color=['celltype.l2', 'celltype.l3'], wspace=1)

Since labels 'other' and 'other T' don't refer to any cell type in particular, we checked their identity in the higher resolution annotation.

In [9]:
ax=sc.pl.umap(atlas, show=False )
sc.pl.umap(atlas[(atlas.obs['celltype.l2']=='gdT') | (atlas.obs['celltype.l2']=='MAIT') | 
                 (atlas.obs['celltype.l2']=='dnT') | (atlas.obs['celltype.l2']=='Platelet')], color='celltype.l2', ax=ax)

'other T' cluster is composed by double negative T cells, gamma delta T cells and MAIT cells. Those cells are not annotaded in PBMC3K, therefore there is no way to determine wether their eventual finding is correct or not. Since our intent is to use PBMC3K as test dataset, we removed 'other T' cells from the analysis. 'other' cluster is mainly composed by platelets and for this reason has been renamed 'Platelet'.

In [10]:
# to remove 'other T' cells
atlas=atlas[atlas.obs['celltype.l1']!='other T']
In [11]:
atlas.obs['Cell type']=atlas.obs['celltype.l1']
# to rename clusters
atlas.obs['Cell type']=atlas.obs['Cell type'].cat.rename_categories(['B', 'CD4 T', 'CD8 T', 'DC', 'Mono', 'NK', 'Platelet'])
atlas.uns['Cell type_colors']=['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2']
atlas.obs['Cell type'].cat.categories
Out[11]:
Index(['B', 'CD4 T', 'CD8 T', 'DC', 'Mono', 'NK', 'Platelet'], dtype='object')

We performed differential expression analysis (DEA) on coarser resoluted clusters and we selected as DEGs genes having at least 1.5 log2FC, with minimal mean of expression 0.25, z-score 5 and expressed at least in 40% of cells within the cluster, in order to obtain shorter but more specific gene lists than a usual DEA.

In [12]:
atlas
Out[12]:
AnnData object with n_obs × n_vars = 154975 × 20729
    obs: 'nCount_ADT', 'nFeature_ADT', 'nCount_RNA', 'nFeature_RNA', 'orig.ident', 'lane', 'donor', 'time', 'celltype.l1', 'celltype.l2', 'celltype.l3', 'Phase', 'nCount_SCT', 'nFeature_SCT', 'Cell type'
    var: 'features'
    uns: 'celltype.l1_colors', 'celltype.l2_colors', 'celltype.l3_colors', 'neighbors', 'log1p', 'Cell type_colors'
    obsm: 'X_apca', 'X_aumap', 'X_pca', 'X_spca', 'X_umap', 'X_wnn.umap'
    varm: 'PCs', 'SPCA'
    obsp: 'distances'
In [13]:
sc.tl.rank_genes_groups(atlas, groupby='Cell type')
WARNING: Default of the method has been changed to 't-test' from 't-test_overestim_var'
In [14]:
# to filter differentially express genes with cia.utils.filter_degs()
gmt=utils.filter_degs(atlas, groupby='Cell type', uns_key='rank_genes_groups', logFC=1.5,  scores=5, perc=40,
                mean=0.25, direction='up')
for i in gmt.keys():
    print(i + ' signature has '+ str(len(gmt[i])) +' genes')
B signature has 115 genes
CD4 T signature has 43 genes
CD8 T signature has 22 genes
DC signature has 156 genes
Mono signature has 674 genes
NK signature has 130 genes
Platelet signature has 132 genes

We then assessed the overlap among lists and, despite the stringent DEA, we found that CD8 T and CD4 T cells signatures show a similarity that is remarkably higher than similarities among all the other signatures (50% of DEGs shared with CD4 T). The authors themselves claimed that those populations are difficult to be effectively discriminated by scRNA-seq alone, in particular CD8 T and CD4 T cells. Citing them: "We found that for CD8+ T cells, the most similar RNA neighbors often reflected a mix of CD8+ and CD4+ T cells (in the RNA KNN graph, there are a total of 944 incorrect edges that connect CD8+ to CD4+ T cells). By contrast, protein neighbors were predominantly correctly identified as CD8+ T cells (in the protein KNN graph, 12 CD8+/CD4+ edges were identified)" [1]. For this reason we can't consider those signatures completely specific and so we don't expect to obtain an ideal classification for those cell types.

In [15]:
# to check gene lists similarity with cia.utils.signatures_similarity()
utils.signatures_similarity(gmt, show='J')
Out[15]:
B CD4 T CD8 T DC Mono NK Platelet
B 1.000000 0.006369 0.000000 0.097166 0.019380 0.004098 0.008163
CD4 T 0.006369 1.000000 0.203704 0.000000 0.000000 0.000000 0.000000
CD8 T 0.000000 0.203704 1.000000 0.000000 0.000000 0.034014 0.006536
DC 0.097166 0.000000 0.000000 1.000000 0.080729 0.007042 0.014085
Mono 0.019380 0.000000 0.000000 0.080729 1.000000 0.005000 0.028061
NK 0.004098 0.000000 0.034014 0.007042 0.005000 1.000000 0.023438
Platelet 0.008163 0.000000 0.006536 0.014085 0.028061 0.023438 1.000000

Since signature refinement is not the intent of this package and of this tutorial, and it should be done before the annotation/classification step, we proceeded using the above described signatures to show the usages of the proposed functions, and to report the automatic annotation performances.

Test dataset

In order to evaluate both the consistency of our method and the performances of classification, we used the PMBC atlas DEGs to automatically annotate the PBMC3K dataset from Satija et al., 2015 [3]. This dataset was annotated by the authors relying on clustering and marker genes expression inspection and it is widely used as reference in the scientific community. We classified this dataset independently from the already present annotation, whose cell labels were used as ground truth to evaluate our classification perfomances with different modalities.

In [16]:
# to load the test dataset
pbmc3k=sc.read('data/pbmc3k.h5ad')

# to normalize and logarithmize the values
sc.pp.normalize_total(pbmc3k, target_sum=1e4)
sc.pp.log1p(pbmc3k)

# to set the .raw attribute
pbmc3k.raw=pbmc3k

# to compute HVG
sc.pp.highly_variable_genes(pbmc3k, min_mean=0.0125, max_mean=3, min_disp=0.5)

# to subset features
pbmc3k=pbmc3k[:, pbmc3k.var.highly_variable]
pbmc3k
Out[16]:
View of AnnData object with n_obs × n_vars = 2638 × 1838
    obs: 'n_genes', 'percent_mito', 'n_counts', 'louvain', 'Cell type'
    var: 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'Cell type_colors', 'neighbors', 'log1p', 'hvg'
    obsm: 'X_draw_graph_fr', 'X_pca', 'X_tsne', 'X_umap'
    obsp: 'connectivities', 'distances'
In [17]:
pbmc3k.uns['Cell type_colors']=['#b2df8a', '#fb9a99', '#1f78b4', '#E4D00A', '#6fadfd', '#d62728','#FF5733']

For the classification of the test dataset we renamed and merged some clusters in order to make easier the comparison and the visualization of results. In particular, ‘CD14+ Monocytes' and 'FCGR3A+ Monocytes' clusters were merged into ‘Mono’.

In [18]:
sc.pl.umap(pbmc3k, color=['louvain','Cell type'], wspace=0.6)

Signature score

signature_score function is based on "gene signature score" calculation method shown in Della Chiara, Gervasoni, Fakiola, Godano et al., 2021 [2]. The "gene signature score" of a signature S in a cell C is defined as follows:

${\bf{GSS(C)}}=\frac{{n}}{L}\frac{{\sum }^n_{i=1}{\bf{X}}i\: for\:{\bf{G}}i \:in \:S}{{\sum }^m_{j=1}{\bf{X}}j\: for\:{\bf{G}}j \:in \:G}$

Where:

  • n is the number of genes in S that are expressed in C;
  • G={𝐺1, 𝐺2, …, 𝐺m} is the list of length m that contains all the gene symbols of the C;
  • S={𝐺1, 𝐺2, …, 𝐺L} is the list of gene symbols of length L that compose the signature;
  • X={𝑋1, 𝑋2, …, 𝑋3} is the vector of gene expression values for genes in list G.

With this score it is possible to condensate in a single value both the proportion of expressed signature genes and their overall expression, enabling researchers to easily study whole signatures expression at single cell level.

Raw score

The raw score is the very same score described in the above mentioned paper, it's the default score_mode in signature_score function.

The function both supports signature dictionaries and canonical gmt files (tab separated and without header), provided as filepath or url.

In [19]:
investigate.signature_score(data=atlas, signatures_dict=gmt, score_mode='raw')
Checking for genes not in AnnData.raw.var_names ...

All signature genes are in AnnData.raw.var_names

Computing raw signature scores ...

"B" added in Anndata.obs
"CD4 T" added in Anndata.obs
"CD8 T" added in Anndata.obs
"DC" added in Anndata.obs
"Mono" added in Anndata.obs
"NK" added in Anndata.obs
"Platelet" added in Anndata.obs
In [20]:
investigate.signature_score(data=atlas, signatures_dict='./data/atlas.gmt', score_mode='raw')
Checking for genes not in AnnData.raw.var_names ...

All signature genes are in AnnData.raw.var_names

Computing raw signature scores ...

"B" added in Anndata.obs
"CD4 T" added in Anndata.obs
"CD8 T" added in Anndata.obs
"DC" added in Anndata.obs
"Mono" added in Anndata.obs
"NK" added in Anndata.obs
"Platelet" added in Anndata.obs

Signature scores are stored by default in AnnData.obs, adding a column for each signature, named accordingly to the signature name.

In [21]:
atlas.obs[gmt.keys()]
Out[21]:
B CD4 T CD8 T DC Mono NK Platelet
L1_AAACCCAAGAAACTCA 0.003391 0.000000 0.000000 0.018421 0.117444 0.002097 0.008695
L1_AAACCCAAGACATACA 0.000906 0.017159 0.005055 0.000544 0.004802 0.005720 0.004066
L1_AAACCCACAACTGGTT 0.001058 0.013799 0.007978 0.000933 0.003893 0.004422 0.005280
L1_AAACCCACACGTACTA 0.000734 0.002076 0.003119 0.000889 0.007379 0.042431 0.006687
L1_AAACCCACAGCATACT 0.000741 0.009486 0.008539 0.000435 0.005705 0.004803 0.004923
... ... ... ... ... ... ... ...
E2L8_TTTGTTGGTCGTGATT 0.001177 0.017618 0.008669 0.001345 0.009061 0.007195 0.004739
E2L8_TTTGTTGGTGTGCCTG 0.004101 0.000249 0.000113 0.012261 0.149902 0.003857 0.013031
E2L8_TTTGTTGGTTAGTTCG 0.035860 0.001166 0.000064 0.010239 0.021855 0.002644 0.006830
E2L8_TTTGTTGGTTGGCTAT 0.004736 0.000176 0.000038 0.020185 0.117610 0.005775 0.008069
E2L8_TTTGTTGTCTCATGGA 0.003600 0.000259 0.000110 0.015657 0.121061 0.002931 0.010978

154975 rows × 7 columns

Alternatively, signature scores can be directly returned as an array by specifying return_array=True.

In [22]:
investigate.signature_score(data=atlas, signatures_dict=gmt, score_mode='raw', return_array=True)
Out[22]:
array([[3.39127186e-03, 0.00000000e+00, 0.00000000e+00, ...,
        1.17443536e-01, 2.09689546e-03, 8.69495659e-03],
       [9.06495660e-04, 1.71590178e-02, 5.05521247e-03, ...,
        4.80166486e-03, 5.72036737e-03, 4.06636299e-03],
       [1.05752930e-03, 1.37988153e-02, 7.97761599e-03, ...,
        3.89274238e-03, 4.42216507e-03, 5.27998382e-03],
       ...,
       [3.58597904e-02, 1.16567592e-03, 6.43309634e-05, ...,
        2.18553844e-02, 2.64438800e-03, 6.82981665e-03],
       [4.73581858e-03, 1.76191316e-04, 3.82637706e-05, ...,
        1.17609661e-01, 5.77541047e-03, 8.06860405e-03],
       [3.60028756e-03, 2.59292776e-04, 1.10356028e-04, ...,
        1.21061250e-01, 2.93086665e-03, 1.09778654e-02]])

Scaled score

Scaled score is the raw score divided by max score value, operation which rescales the values from 0 to 1. Scaled scores of different signatures, with different length, can thus be directly compared. To compute the 'scaled' score, score_mode must be set as 'scaled'.

To assess the capability of extracted signatures to discriminate cell types, we compared scaled score distributions.

In [23]:
investigate.signature_score(data=atlas, signatures_dict=gmt, score_mode='scaled')
Checking for genes not in AnnData.raw.var_names ...

All signature genes are in AnnData.raw.var_names

Computing scaled signature scores ...

"B" added in Anndata.obs
"CD4 T" added in Anndata.obs
"CD8 T" added in Anndata.obs
"DC" added in Anndata.obs
"Mono" added in Anndata.obs
"NK" added in Anndata.obs
"Platelet" added in Anndata.obs
In [24]:
sc.pl.umap(atlas, color='Cell type')

By inspecting the score values, for all the signatures, the highest values are found in the proper cluster, indicating the sensitivity of the signatures and the capability of the signature score to represent the expression of the whole gene lists.

In [25]:
sc.pl.umap(atlas, color=gmt.keys(), color_map='Reds')

CD4 T and CD8 T scores, as expected, show an overlap reflecting their similarity. In particular both signatures are highly expressed in CD8 Naive subcluster.

In [26]:
ax=sc.pl.umap(atlas, show=False )
sc.pl.umap(atlas[(atlas.obs['celltype.l2']=='CD8 Naive')], color='celltype.l2', ax=ax)

Inspecting the violin plots showing the distribution of score values, as expected, it seems that there is a cluster in which values are higher than all the others for each signature.

In [27]:
sc.pl.violin(atlas, keys=gmt.keys(), groupby='Cell type')

To better viusualize those distributions, we exploited grouped_distributions. By selecting AnnData.obs columns containing signature scores, this function plots a heatmap showing the medians of their values in cell groups (each of the above shown violin plot information is condensed in a heatmap column) and it prints a statistical report. For each cell group, a two-sided Wilcoxon test is perfomed to evaluate if the distribution with the highest median is different from the others. For each signature, a two-sided Mann-Whitney U test is performed to evaluate if the distribution in the cell group having the highest median is different from the other groups distributions.

In [29]:
report.grouped_distributions(atlas, groups_obs='Cell type', columns_obs=gmt.keys(), scale_medians='column-wise', cmap='Reds')
Performing Wilcoxon test on each cell group ...
For each cell group there is a distribution significantly higher than the others (p<0.01)

Performing Mann-Whitney U test on each selected AnnData.obs column ...
For each distribution, there is only a cell group in which values are higher with respect to all the other groups  (p<0.01)
Out[29]:
<AxesSubplot: ylabel='Cell type'>

The statistical tests confirmed that the visible differences in signature score distributions are significant, indicating that scaled signature scores are consistent with authors annotation. With the evidence of the goodness of the signatures, we proceeded with the classification of PBMC3K.

Signature-based classification

To classify the test dataset we used signature_based_classification. Two main classification modalities have been implemented: a fast classification mode, in which scaled scores are directly computed and compared, and a FC score-based mode that, relying on the comparison between raw signature scores and random signature scores, can provide a more confident classification at the cost of computation time.

Fast classification

In [30]:
report.grouped_distributions(atlas, columns_obs=list(gmt.keys()), groups_obs='Cell type', cmap='Reds', scale_medians='column-wise')
Performing Wilcoxon test on each cell group ...
For each cell group there is a distribution significantly higher than the others (p<0.01)

Performing Mann-Whitney U test on each selected AnnData.obs column ...
For each distribution, there is only a cell group in which values are higher with respect to all the other groups  (p<0.01)
Out[30]:
<AxesSubplot: ylabel='Cell type'>

Fast classification is performed by assigning to each cell the label of the signature with the max scaled score value. Being based on matrices and vectors operations, this computation is very fast.

In [31]:
investigate.signature_based_classification(data=pbmc3k, signatures_dict='./data/atlas.gmt', 
                                           fast_mode=True, obs_name='Prediction fast mode')
14/115 of "B" signature genes were removed since they are not in AnnData.raw.var_names
3/43 of "CD4 T" signature genes were removed since they are not in AnnData.raw.var_names
2/22 of "CD8 T" signature genes were removed since they are not in AnnData.raw.var_names
5/156 of "DC" signature genes were removed since they are not in AnnData.raw.var_names
25/674 of "Mono" signature genes were removed since they are not in AnnData.raw.var_names
4/130 of "NK" signature genes were removed since they are not in AnnData.raw.var_names
9/132 of "Platelet" signature genes were removed since they are not in AnnData.raw.var_names

Classification labels added in AnnData.obs["Prediction fast mode"]

Runtime of the process is 0.23 s 
Out[31]:
'Fast classification complete!'
In [32]:
sc.pl.umap(pbmc3k, color=gmt, cmap='Reds', save='_scaled_scores.pdf')
WARNING: saving figure to file figures/umap_scaled_scores.pdf
In [33]:
report.grouped_distributions(pbmc3k, groups_obs='Cell type', columns_obs=['CD4 T', 'B', 'CD8 T', 'NK', 'Mono', 'DC', 'Platelet'], scale_medians='column-wise', cmap='Reds')
Performing Wilcoxon test on each cell group ...
WARNING in cell group DC: DC values are not significantly different from Mono values.

Performing Mann-Whitney U test on each selected AnnData.obs column ...
For each distribution, there is only a cell group in which values are higher with respect to all the other groups  (p<0.01)
Out[33]:
<AxesSubplot: ylabel='Cell type'>

The signature score distribution of Dendritic Cells (DC) and Monocytes (Mono) were not significantly different within the manually annotated dendritic cells cluster (Wilcoxon test). However, since for each distribution, there is only a cell group in which values are significantly higher with respect to all the other groups (Mann-Whitney U test), the simple inspection of PBMC3K UMAP showing the different signature scores is sufficient to highlight the sought cell populations. This confirmed the consistency of the signatures across datasets and suggested that the scaled score could be per se sufficient to help researchers in an eventual annotation, permitting to skip the one-by-one evaluation of DEGs reported in literature as marker genes.

By inspecting and comparing the ground truth and the predicted labels is evident that the vast majority of the cells was correctly classified.

In [34]:
pbmc3k.uns['Prediction fast mode_colors']= ['#fb9a99', '#b2df8a', '#1f78b4', '#d62728', '#6fadfd', '#E4D00A','#FF5733']
In [35]:
sc.pl.umap(pbmc3k, color=['Cell type','Prediction fast mode'], wspace=0.5)
In [36]:
report.group_composition(pbmc3k, classification_obs='Prediction fast mode', groups_obs='Cell type',
                  columns_order=['CD4 T', 'B', 'CD8 T', 'NK', 'Mono', 'DC', 'Platelet'], cmap='Greens')
Out[36]:
<AxesSubplot: xlabel='Prediction fast mode', ylabel='Cell type'>

FC score-based classification

FC score-based classification is performed by looking for the max Fold Change score value for each cell instead of scaled score one. FC score is computed by dividing the raw signatue score by the median of raw scores of a given number (n_iter) of random signatures. To generate random signatures, all the genes of the dataset are ranked by their mean expression and are assigned to a given bin, and for each signature gene a random gene of the same expression bin is picked.

Standard mode

Standard mode is performed with default parameters: all the FC scores lower than 1 (random score < signature score) are turned into 0, and if a cell has a FC score equal to 0 for all the signature, it will not be assigned to any cell type. This approach is slower with respect to fast mode and it could be affected by a decrease of performances due to unassigned cells, but in this way the classified cells are labelled with an higher confidence and cells with low FCs (e.g. not fully differentiated) are not forced to be classified. To speed-up the process it's possible to parallelize the computations allocating jobs to a given number of processors (n_proc). By setting new_score='FC', raw scores in AnnData.obs can be replaced by FC values.

In [37]:
investigate.signature_based_classification(data=pbmc3k, signatures_dict=gmt,
                    n_iter=500, n_proc=32,new_score='filtered', obs_name='Prediction standard mode', FC_threshold= 1, n_bins=25)
Checking for genes not in AnnData.raw.var_names ...

14/115 of "B" signature genes were removed since they are not in AnnData.raw.var_names
3/43 of "CD4 T" signature genes were removed since they are not in AnnData.raw.var_names
2/22 of "CD8 T" signature genes were removed since they are not in AnnData.raw.var_names
5/156 of "DC" signature genes were removed since they are not in AnnData.raw.var_names
25/674 of "Mono" signature genes were removed since they are not in AnnData.raw.var_names
4/130 of "NK" signature genes were removed since they are not in AnnData.raw.var_names
9/132 of "Platelet" signature genes were removed since they are not in AnnData.raw.var_names

Computing raw signature scores ...

"B" added in Anndata.obs
"CD4 T" added in Anndata.obs
"CD8 T" added in Anndata.obs
"DC" added in Anndata.obs
"Mono" added in Anndata.obs
"NK" added in Anndata.obs
"Platelet" added in Anndata.obs
Number of chunks: 32

Classification labels added in AnnData.obs["Prediction standard mode"]

Results have been stored in AnnData.uns["signature_based_classification"]

raw scores are being replaced by filtered Fold Change signature scores ...


Runtime of the process is 0.29 min with 32 cores

Beside the possibility to substitute raw score values in AnnData.obs, by default both FC scores and filtered FC scores are stored in AnnData.uns.

In [38]:
pbmc3k.uns["signature_based_classification"]
Out[38]:
B_FC CD4 T_FC CD8 T_FC DC_FC Mono_FC NK_FC Platelet_FC B_thr CD4 T_thr CD8 T_thr ... Mono_thr NK_thr Platelet_thr B_filtered CD4 T_filtered CD8 T_filtered DC_filtered Mono_filtered NK_filtered Platelet_filtered
0 0.851506 2.486842 4.570688e+00 0.334143 0.508884 2.031563 0.879401 1.0 1.0 1.0 ... 1.0 1.0 1.0 0.000000 2.486842 4.570688 0.000000 0.000000 2.031563 0.000000
1 8.606316 0.284275 1.909269e-01 2.184825 0.812732 0.590000 1.273945 1.0 1.0 1.0 ... 1.0 1.0 1.0 8.606316 0.000000 0.000000 2.184825 0.000000 0.000000 1.273945
2 0.241651 3.842619 1.307566e+00 0.217219 0.575585 0.681604 1.241158 1.0 1.0 1.0 ... 1.0 1.0 1.0 0.000000 3.842619 1.307566 0.000000 0.000000 0.000000 1.241158
3 1.057598 0.018496 4.720653e-02 2.196682 4.373617 0.513429 1.171368 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.057598 0.000000 0.000000 2.196682 4.373617 0.000000 1.171368
4 0.352329 0.213158 3.519879e+00 1.663561 0.354720 15.888616 1.141829 1.0 1.0 1.0 ... 1.0 1.0 1.0 0.000000 0.000000 3.519879 1.663561 0.000000 15.888616 1.141829
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2633 1.253719 0.011487 3.053506e-02 4.202156 4.395560 0.378056 0.830515 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.253719 0.000000 0.000000 4.202156 4.395560 0.000000 0.000000
2634 4.065053 0.723743 1.827626e-02 1.418572 0.880338 0.581645 1.395401 1.0 1.0 1.0 ... 1.0 1.0 1.0 4.065053 0.000000 0.000000 1.418572 0.000000 0.000000 1.395401
2635 26.878841 0.358261 1.048104e-42 4.868712 0.983395 0.236162 0.509639 1.0 1.0 1.0 ... 1.0 1.0 1.0 26.878841 0.000000 0.000000 4.868712 0.000000 0.000000 0.000000
2636 15.592578 0.026575 9.123587e-43 2.884721 0.716081 0.760254 0.462967 1.0 1.0 1.0 ... 1.0 1.0 1.0 15.592578 0.000000 0.000000 2.884721 0.000000 0.000000 0.000000
2637 0.947321 4.827974 2.344116e+00 0.558299 0.502004 0.370517 2.344615 1.0 1.0 1.0 ... 1.0 1.0 1.0 0.000000 4.827974 2.344116 0.000000 0.000000 0.000000 2.344615

2638 rows × 21 columns

In [39]:
df=pbmc3k.uns['signature_based_classification']
df=df.iloc[:, 14:21]
df.index=pbmc3k.obs.index
pbmc3k.obs[df.columns]=df
In [40]:
filtered_FC_pbmc3k=['CD4 T_filtered','B_filtered','CD8 T_filtered', 'NK_filtered','Mono_filtered','DC_filtered','Platelet_filtered']

By inspecting the distributions of filtered FC scores, it is even more evident that each signature is preferentially expressed by one cell group.

In [41]:
report.grouped_distributions(pbmc3k, columns_obs=filtered_FC_pbmc3k, groups_obs='Cell type', cmap='Reds', scale_medians='column-wise')
Performing Wilcoxon test on each cell group ...
For each cell group there is a distribution significantly higher than the others (p<0.01)

Performing Mann-Whitney U test on each selected AnnData.obs column ...
For each distribution, there is only a cell group in which values are higher with respect to all the other groups  (p<0.01)
Out[41]:
<AxesSubplot: ylabel='Cell type'>

Indeed, the majority of each cell group has been correctly classified.

In [42]:
pbmc3k.uns['Prediction standard mode_colors']= ['#fb9a99', '#b2df8a', '#1f78b4', '#d62728', '#6fadfd', '#E4D00A',
                                                '#FF5733', '#762a83']
In [43]:
sc.pl.umap(pbmc3k,color=['Cell type', 'Prediction standard mode'], wspace=0.5)
In [44]:
report.group_composition(pbmc3k, classification_obs='Prediction standard mode', groups_obs='Cell type',
                  columns_order=['CD4 T', 'B', 'CD8 T', 'NK', 'Mono', 'DC', 'Platelet'], cmap='Greens')
Out[44]:
<AxesSubplot: xlabel='Prediction standard mode', ylabel='Cell type'>

No major differences have been detected between fast and standard classification in both datasets, suggesting a per se validity of scaled score to perform a fast explorative cell type prediction.

Quantile-based filtering mode

Quantile-based filter adds another layer of stringency to the classification by acting on FC distributions. By setting q equal to a number from 0 to 1, all the FC score values below that given quantile are set 0. In this way only the highest values of FC are mainteined and so only the most signature-expressing cells will be classified. This could be very useful in case of bimodal distribution of score values or when a signature is moderately expressed in all the clusters.

In [45]:
investigate.signature_based_classification(data=pbmc3k, signatures_dict=gmt,
                    n_iter=500, n_proc=32, obs_name='Prediction q', q=0.50, n_bins=25)
Checking for genes not in AnnData.raw.var_names ...

14/115 of "B" signature genes were removed since they are not in AnnData.raw.var_names
3/43 of "CD4 T" signature genes were removed since they are not in AnnData.raw.var_names
2/22 of "CD8 T" signature genes were removed since they are not in AnnData.raw.var_names
5/156 of "DC" signature genes were removed since they are not in AnnData.raw.var_names
25/674 of "Mono" signature genes were removed since they are not in AnnData.raw.var_names
4/130 of "NK" signature genes were removed since they are not in AnnData.raw.var_names
9/132 of "Platelet" signature genes were removed since they are not in AnnData.raw.var_names

Computing raw signature scores ...

"B" added in Anndata.obs
"CD4 T" added in Anndata.obs
"CD8 T" added in Anndata.obs
"DC" added in Anndata.obs
"Mono" added in Anndata.obs
"NK" added in Anndata.obs
"Platelet" added in Anndata.obs
Number of chunks: 32

Classification labels added in AnnData.obs["Prediction q"]

Results have been stored in AnnData.uns["signature_based_classification"]

Runtime of the process is 0.3 min with 32 cores
In [46]:
pbmc3k.uns['Prediction q_colors']=['#fb9a99', '#b2df8a', '#1f78b4', '#d62728', '#6fadfd', '#E4D00A',
       '#FF5733', '#762a83']
In [47]:
sc.pl.umap(pbmc3k,color=['Cell type', 'Prediction q'], wspace=0.5)
In [48]:
df=pbmc3k.uns['signature_based_classification']
df=df.iloc[:, 14:21]
df.index=pbmc3k.obs.index
pbmc3k.obs[df.columns]=df

report.grouped_distributions(pbmc3k, columns_obs=filtered_FC_pbmc3k, groups_obs='Cell type', cmap='Reds', scale_medians='column-wise')
Performing Wilcoxon test on each cell group ...
For each cell group there is a distribution significantly higher than the others (p<0.01)

Performing Mann-Whitney U test on each selected AnnData.obs column ...
For each distribution, there is only a cell group in which values are higher with respect to all the other groups  (p<0.01)
Out[48]:
<AxesSubplot: ylabel='Cell type'>
In [49]:
report.group_composition(pbmc3k, classification_obs='Prediction q', groups_obs='Cell type',
                  columns_order=['CD4 T', 'B', 'CD8 T', 'NK', 'Mono', 'DC', 'Platelet', 'Unassigned'], cmap='Greens')
Out[49]:
<AxesSubplot: xlabel='Prediction q', ylabel='Cell type'>

Again, no major differences have been found, indicating the consistency of the signatures. As expected, the clusters with the highest variability are CD8 T and CD4 T, which, as mentioned before, have aspecific signatures.

p value-based filtering mode

The most stringent level of classification is given by p-val mode. As mentioned before, FC score is obtained by dividing raw signature scores for the median of scores of n_iter * randomic signatures. For each signature, for each cell, a p-value is calculated by counting how many times random signature values are higher than gene signature one (all divided by n_inter). If p value is higher than the set p parameter, the FC score of that cell for that signature will become 0. In this way only the significantly signature-expressing cells will be assigned.

In [50]:
investigate.signature_based_classification(data=pbmc3k, signatures_dict=gmt,
                    n_iter=500, n_proc=32, obs_name='Prediction p-val', p=0.05, n_bins=25, new_score='FC')
Checking for genes not in AnnData.raw.var_names ...

14/115 of "B" signature genes were removed since they are not in AnnData.raw.var_names
3/43 of "CD4 T" signature genes were removed since they are not in AnnData.raw.var_names
2/22 of "CD8 T" signature genes were removed since they are not in AnnData.raw.var_names
5/156 of "DC" signature genes were removed since they are not in AnnData.raw.var_names
25/674 of "Mono" signature genes were removed since they are not in AnnData.raw.var_names
4/130 of "NK" signature genes were removed since they are not in AnnData.raw.var_names
9/132 of "Platelet" signature genes were removed since they are not in AnnData.raw.var_names

Computing raw signature scores ...

"B" added in Anndata.obs
"CD4 T" added in Anndata.obs
"CD8 T" added in Anndata.obs
"DC" added in Anndata.obs
"Mono" added in Anndata.obs
"NK" added in Anndata.obs
"Platelet" added in Anndata.obs
Number of chunks: 32

Classification labels added in AnnData.obs["Prediction p-val"]

Results have been stored in AnnData.uns["signature_based_classification"]

raw scores are being replaced by Fold Change signature scores ...


Runtime of the process is 0.3 min with 32 cores
In [51]:
pbmc3k.uns['signature_based_classification']
Out[51]:
B_FC CD4 T_FC CD8 T_FC DC_FC Mono_FC NK_FC Platelet_FC B_thr CD4 T_thr CD8 T_thr ... Mono_thr NK_thr Platelet_thr B_filtered CD4 T_filtered CD8 T_filtered DC_filtered Mono_filtered NK_filtered Platelet_filtered
0 0.851506 2.486842 4.570688e+00 0.334143 0.508884 2.031563 0.879401 0.003649 0.004521 0.003957 ... 0.013065 0.005427 0.004545 0.000000 2.486842 4.570688 0.000000 0.000000 2.031563 0.000000
1 8.606316 0.284275 1.909269e-01 2.184825 0.812732 0.590000 1.273945 0.004090 0.006752 0.005446 ... 0.019796 0.008938 0.006759 8.606316 0.000000 0.000000 2.184825 0.000000 0.000000 1.273945
2 0.241651 3.842619 1.307566e+00 0.217219 0.575585 0.681604 1.241158 0.003903 0.005278 0.004387 ... 0.016536 0.007604 0.005334 0.000000 3.842619 0.000000 0.000000 0.000000 0.000000 0.000000
3 1.057598 0.018496 4.720653e-02 2.196682 4.373617 0.513429 1.171368 0.004585 0.007707 0.006604 ... 0.011955 0.009093 0.007689 0.000000 0.000000 0.000000 2.196682 4.373617 0.000000 1.171368
4 0.352329 0.213158 3.519879e+00 1.663561 0.354720 15.888616 1.141829 0.002952 0.005074 0.003497 ... 0.009593 0.004040 0.004381 0.000000 0.000000 0.000000 0.000000 0.000000 15.888616 0.000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2633 1.253719 0.011487 3.053506e-02 4.202156 4.395560 0.378056 0.830515 0.005636 0.009159 0.007444 ... 0.015755 0.011483 0.009344 1.253719 0.000000 0.000000 4.202156 4.395560 0.000000 0.000000
2634 4.065053 0.723743 1.827626e-02 1.418572 0.880338 0.581645 1.395401 0.003018 0.004871 0.003946 ... 0.013887 0.006005 0.005107 4.065053 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
2635 26.878841 0.358261 1.048104e-42 4.868712 0.983395 0.236162 0.509639 0.001959 0.004323 0.003900 ... 0.009257 0.005423 0.004382 26.878841 0.000000 0.000000 4.868712 0.000000 0.000000 0.000000
2636 15.592578 0.026575 9.123587e-43 2.884721 0.716081 0.760254 0.462967 0.002799 0.004361 0.004252 ... 0.009250 0.004730 0.004570 15.592578 0.000000 0.000000 2.884721 0.000000 0.000000 0.000000
2637 0.947321 4.827974 2.344116e+00 0.558299 0.502004 0.370517 2.344615 0.002938 0.003745 0.003789 ... 0.009254 0.004764 0.003940 0.000000 4.827974 0.000000 0.000000 0.000000 0.000000 2.344615

2638 rows × 21 columns

In [52]:
pbmc3k.uns['Prediction p-val_colors']=['#fb9a99', '#b2df8a', '#1f78b4', '#d62728', '#6fadfd', '#E4D00A',
       '#FF5733', '#762a83']
In [53]:
sc.pl.umap(pbmc3k, color=['Cell type','Prediction p-val'], wspace=0.5)
In [54]:
sc.pl.umap(pbmc3k, color=filtered_FC_pbmc3k, cmap='Reds')
In [55]:
df=pbmc3k.uns['signature_based_classification']
df=df.iloc[:, 14:21]
df.index=pbmc3k.obs.index
pbmc3k.obs[df.columns]=df

report.grouped_distributions(pbmc3k, columns_obs=filtered_FC_pbmc3k, groups_obs='Cell type', cmap='Reds', scale_medians='column-wise')
plt.savefig('figures/heatmap_pval.pdf')
Performing Wilcoxon test on each cell group ...
For each cell group there is a distribution significantly higher than the others (p<0.01)

Performing Mann-Whitney U test on each selected AnnData.obs column ...
For each distribution, there is only a cell group in which values are higher with respect to all the other groups  (p<0.01)
In [56]:
report.group_composition(pbmc3k, classification_obs='Prediction p-val', groups_obs='Cell type',
                  columns_order=['CD4 T', 'B', 'CD8 T', 'NK', 'Mono', 'DC', 'Platelet', 'Unassigned'], cmap='Greens')
Out[56]:
<AxesSubplot: xlabel='Prediction p-val', ylabel='Cell type'>

Again, no major differences were found with the statistically relevant p value mode, suggesting the consistency of our classification method.

Classification performance evaluation

To evaluate classification performances both per ground truth cluster and overall we exploited respectively grouped_classification_metrics and classification_metrics functions.

In both functions cell labels assigned by CIA and the annotation already present in test datasets are compared in order to count true positive (TP), true negative (TN), false positive (FP) and false negative (FN) cells for each cluster. Only for the overall calculation the per-cluster counts are summed to obtain the total TN, TP, FN and FP.

Then, again for both functions, the following metrics are calculated:

  • Sensitivity (SE)= TP/(TP+FN)
  • Specificity (SP)= TN/(TN+FP)
  • Precision (PR)= TP/(TP+FP)
  • Accuracy (ACC)= (TN+TP)/(TN+TP+FN+FP)
  • F1-score (F1)= 2TP/(2TP+FN+FP)

N.B.: the column of the classification of interest and the one with ground truth labels must have the same categories to be compared.

Since FC-based classification exploits random signatures generation classification results may slightly differ each time the kernel is restarted. For this reason we saved the labels of the classification in 'data/pbmc3k_classified.h5ad'. From now on, for reproducibility pourposes, we use those results to calculate performances.

In [60]:
pbmc3k=sc.read('data/pbmc3k_classified.h5ad')

Here, for clarity, we show only the per-cluster classification metrics of the statistically relevant p value-based method for both datasets.

In [61]:
report.grouped_classification_metrics(pbmc3k, classification_obs='Prediction p-val',groups_obs='Cell type')
Out[61]:
SE SP PR ACC F1
CD4 T 0.755245 0.983936 0.972973 0.884761 0.850394
B 0.994152 0.996951 0.979827 0.996588 0.986938
CD8 T 0.756329 0.941430 0.637333 0.919257 0.691751
NK 0.974026 0.979066 0.742574 0.978772 0.842697
Mono 0.976190 0.996016 0.987159 0.991281 0.981644
DC 0.945946 0.994233 0.700000 0.993556 0.804598
Platelet 0.933333 0.991994 0.400000 0.991660 0.560000

And here are reported the overall perfomances of each classification modality:

In [62]:
report.classification_metrics(pbmc3k, 
                       classification_obs=['Prediction fast mode', 'Prediction standard mode', 'Prediction q', 'Prediction p-val'],
                       groups_obs='Cell type')
Out[62]:
SE SP PR ACC F1
Prediction fast mode 0.915845 0.985974 0.915845 0.975956 0.915845
Prediction standard mode 0.886277 0.981046 0.886277 0.967508 0.886277
Prediction q 0.884382 0.980983 0.885725 0.967183 0.885053
Prediction p-val 0.855572 0.983384 0.895635 0.965125 0.875145

Conclusions

Signature score is confirmed to be a powerful way to condensate the information of a whole gene signature expression at single cell level. The only differenceres from the expected score distributions were attributed to the high transcriptional similarity (reported by the authors themselves) of some PBMC atlas cell group (from which we extracted the signatures), which have been previously defined also exploiting protein level information. This further suggested the capability of this score to clearly highlight the goodness of the transcriptional signature itself.

Signature based classification resulted to be effective in the automatic annotation of scRNA-seq datasets with external signatures. In particular, considering the overall performances of CIA in the PBMC3K test dataset:

  • our method is able to rapidly classify a scRNA-seq dataset using independent gene signatures and known negative markers without any clustering step;
  • all the shown classification modalities annotated cells with good performances (lowest ACC: 91.58%, lowest F1:87.61%), suggesting that fast classification can be used as explorative analysis to infer cell idendity before a run with the more computationally intense FC-based mode.

In our package, besides the classification tool, we also implemented a module of functions which allows to easily compare classification methods and evaluate score distributions in cell groups (also obtained with other packages).

To visualize how CIA performs with respect to other classifiers compatible with Scanpy, see CIA benchmarking tutorial.